Using large language models for safety-related table summarization in clinical study reports

Abstract Objectives The generation of structured documents for clinical trials is a promising application of large language models (LLMs). We share opportunities, insights, and challenges from a competitive challenge that used LLMs for automating clinical trial documentation. Materials and Methods As part of a challenge initiated by Pfizer (organizer), several teams (participant) created a pilot for generating summaries of safety tables for clinical study reports (CSRs). Our evaluation framework used automated metrics and expert reviews to assess the quality of AI-generated documents. Results The comparative analysis revealed differences in performance across solutions, particularly in factual accuracy and lean writing. Most participants employed prompt engineering with generative pre-trained transformer (GPT) models. Discussion We discuss areas for improvement, including better ingestion of tables, addition of context and fine-tuning. Conclusion The challenge results demonstrate the potential of LLMs in automating table summarization in CSRs while also revealing the importance of human involvement and continued research to optimize this technology.


Background and significance
The clinical study report (CSR) is a highly structured document that follows the format outlined in ICH E3 CSR. 1,2ne time-intensive aspect of preparing a CSR is the review and description of safety data. 3LMs are artificial neural networks 4 which achieve text generation capabilities by leveraging massive amounts of data to learn billions of parameters during training. 5,6They are believed to acquire knowledge regarding syntax, semantics, and underlying "ontology" of human language. 7,8The success of ChatGPT passing the United States Medical Licensing Exam (USMLE) 9-11 signals a potential breakthrough in LLMs' ability to generate clinical insights.
One challenging aspect of automating CSR creation is extracting relevant information from tables.Clinical meaningfulness holds utmost importance.Achieving it often necessitates inclusion of supplementary information, such as comprehensive clinical expertise, and the study protocol.Furthermore, inference could emerge through connections across tables. 12here currently exists no software for CSR generation that uses LLMs as the main engine.A challenge was organized to examine what can be achieved using this technology.Participants were blinded to each other's solution to foster independence and to apply their unique capabilities without bias.To define scope appropriate for a 6-week challenge, 13 this experiment focused on the CSR Safety Summary section only, for a single therapeutic area, namely Inflammation and Immunology.Participants were challenged with generation of summary text for the sub-sections on Adverse Events, Deaths, Laboratory Results, Vital Signs, Electrocardiograms and Physical Examination Findings.

Methods
The challenge was conducted between August 16 and October 5, 2023, with the initial call for submissions receiving a positive response from 23 external business entities from the United States, India, Germany, Ireland, France, Israel, United Kingdom, and Czech Republic.Based on these initial written proposals, six entities (technology companies of varied size) were selected to participate in the challenge.The text of the challenge statement can be found in Supplementary materials.Participants were not compensated but competed for the opportunity to collaborate with the organizer in the future.
Safety outputs of 72 CSRs from recently completed studies were identified for the training and test sets.These data are highly representative of what is currently being used by clinical small and medium-sized enterprises (SMEs) to prepare a CSR.Tables were supplied in the exact format that is currently used (HTML for in-text tables, PDF for out-of-text tables).The training set included studies from phase 1 to 3 trials; 58% of studies from phase 1, and 42% phases 2 and 3.In total, it included CSRs from 17 different drug assets covering a wide variety of safety-related events.
The CSRs were divided into 70% model training and the remaining 30% reserved for testing purposes.Training data included the CSR body text, safety summary data tables, protocols, and the safety narrative plans.Testing data included only the tables, protocols, and safety narrative plans.The task was to generate the text.No individual subject data were provided.
The models were developed by challenge participants over the course of 6 weeks using the training set and additional data provided by the organizer.Following this, the test set of tables (from 22 CSRs) was released to the participants, and they were required to produce the safety section of the CSRs within 24 h.The model output was evaluated by the organizer team, blinded to participant names.

Environment and technical ground rules
The challenge was carried out by participant teams in a private, multi-tenancy compute workspace set up by the organizers on Databricks platform, utilizing a g5.24 × large instance with four graphics processing units (GPUs).This environment provided personalized access and ensured data isolation within a shared infrastructure.Teams had access to GPT-3.5-turbo and lower versions.Fine-tuning was permitted on non-GPT, locally hosted models only.Vendors were evaluated according to three criteria domains: (1) Technical score: An evaluation to assess factual accuracy and text similarity scores via comparison to original CSRs.(2) Business score: An evaluation to assess overall usability of AIgenerated text based on lean writing (concise, inferential, and relevant statements) and provenance (data traceability and extent of hallucination) for business users.(3) Implementation score: An evaluation to assess team's presentations on the dimensions of technical approach, scalability, demo, and usability.Raters consisted of a multi-disciplinary team of 17 organizer members including data scientists, clinical statisticians, and medical writers (see challenge statement in Supplementary material). 6

Technical score
This included automated text evaluation scores and factual accuracy ratings.Automated metrics were text similarity scores comparing model output with original CSR text: Rouge-1 and Rouge-L. 14,15Numeric similarity was quantified by considering numeric values in the original CSR text and model output as two sets and calculating the Jaccard coefficient. 16Further, based on the original CSRs, the fraction of specific keywords (eg, unique safety issues within the study) present in the text was determined.Finally, semantic similarity was evaluated using GPT-4 in a fashion similar to GPT-score 17 and G-Eval, 18 by prompting GPT-4 to count the number of facts in the original CSR text (O) and counting how many facts in the model output text have the same semantic meaning as facts in the original text (M).The fraction M/O was used as a metric of semantic similarity.All scores were scaled to range 0-1.The mean of all automated text metrics was used for further analysis.
Factual accuracy ratings were performed manually by a team of raters.For each claim, factual accuracy was determined based on whether the claim is supported by the table data. 12,19,20All scores were scaled to range 0-1.The mean factual accuracy score across CSRs was used for further analysis.The mean of text metrics and factual accuracy constituted the technical score.

Business score
This assessment was conducted by SMEs (organizer medical writing team) on the dimensions of Lean Writing and Provenance, which in turn consist of four and three items, respectively.The Lean Writing score evaluates inclusion of summary statements, presence of excessive repetitiveness, inclusion of inferential statements, and relevance of provided text.2][23] The scoring sheet can be found in Table S1, Supplementary material.

Implementation
In addition to text outputs, participants gave presentations outlining their approach, as well as a demo.They described their plans, should they enter into collaboration with the organizer.Each presentation was rated on dimensions of Technical Approach, Scalability, Demo and Usability.For each of the dimensions, raters were given specific pointers to assess.See Supplementary Table S2 for details.The final score is the average of all scores after scaling to a range 0-1.

Results
A key task in the CSR generation process is the ability to extract facts from study tables and listings and reformulate that information into concise, accurate text. 24Challenge participants approached this through diverse ingestion methods with varied success.Outputs often showed excellent comprehension of table structure, for example, distinguishing between treatment arms, although occasional parsing errors were observed.
The evaluation scores revealed differences in performance across the six teams (Figure 1).The teams diverged most in Factual Accuracy, indicating variability in precision of information prioritized for generation.Use of keywords and semantic similarity also varied widely, highlighting contrasts in utilizing relevant terms to inform study-specific safety issues and aligning content with standards.Similarly, teams' scores demonstrated marked disparities in the domain-specific skills of Lean Writing and Provenance (see Supplementary material).On the other hand, metrics like Rouge-1, Rouge-L, and Number Overlap showed a narrower range of scores, pointing to a baseline competency shared by all teams in unigram matching, and sequence prediction.This stratification of results highlights the variability in different approaches employed by participants.
Most teams in the challenge used the approach shown in Figure 2. To extract tables from CSRs, teams employed different approaches like GPT, regular expressions, and a combination of automated tools with human oversight to ensure precise data capture.Participants used a variety of strategies for the prompt engineering stage to enhance model performance-ranging from sophisticated filtering algorithms to the application of strict inclusion/exclusion criteria and the use of arithmetic logic to draw inferences (see Supplementary material for example prompts and solutions outline).Lastly, in the score results stage, we assessed the generated text outputs using a single score or a combination of metrics, addressing various aspects of summary quality.Variation was seen in the timing and level of involvement for a human expert in the loop.While some participants allowed humans to intervene at intermediate steps including table parsing (which led to  much greater data extraction accuracy), others limited human feedback to prompt engineering only.

Discussion
The challenge helped test generative AI and understand the opportunities and challenges with this technology for productivity improvement in pharma's clinical development process.Central to this initiative, we developed a comprehensive evaluation framework for AI-generated output that employed a blend of automated metrics and SME reviews to assess document quality.This multifaceted approach not only ensured a robust validation, but also set a benchmark for scoring that is adaptable to related applications.
One limitation of the current study is the GPT version that was used, and the lack of fine-tuning capability for GPT models in our environment.The current Generative AI challenge environment was restricted to GPT-3.5-turbo and smaller non-GPT models.GPT-4, 25 which became available late in the challenge, has a longer context, is more steerable using personas, and is less likely to fabricate facts. 26,27GPT4 will be tested in future work.
9][30] One team performed fine-tuning on a FLAN-T5-XL model, 31 which showed higher ROUGE scores compared to prompt engineering.With more training data, this could be pursued in the future.Conversely, another team performed fine-tuning on a LLaMA 7B model, 32 which gave a lift in quantitative metrics but showed hallucinations and erroneous summaries.Possible reasons include the small size of the training set, 33 and the mix of one-to-one and many-tomany table to summary mappings. 34While fine-tuning GPT models show promise, large-scale deployment will require careful evaluation to ensure cost-effectiveness.
Aspects such as ranking of facts by importance, and inference cannot at present be automated, and therefore require SMEs. 35In evaluating model performance, factual accuracy should weigh higher than n-gram score comparisons to original CSR text.][40][41] We expect productive gain from using LLMs of 20% time savings near-term and up to 50% after additional development and integration into business process.The impact will increase with inclusion of individual subject data.Broader use cases include additional CSR sections and therapeutic areas.Implementation of LLMs in production may encounter surmountable roadblocks such as the need for workflow adaptation, running LLMs in a secure environment, table standardization, added context and dealing with LLM system time-outs, for example, by using asynchronous LLM calls.Finally, proactive communication with regulators is essential to establish a clear understanding of the regulatory pathway.
In summary, this challenge demonstrated the potential for using large language models to automate safety-related table summarization in CSRs, while also highlighting areas for improvement.Key learnings include the need for human involvement, especially SMEs, to ensure accuracy and relevance.Continued research into integrating different AI methods with interactive human oversight will be important steps to realize the potential of this technology.

Figure 1 .
Figure 1.Scores in technical and business domain: (left to right) factual accuracy, automated similarity metrics Rouge-1, Rouge-L, number overlap, presence of keywords and semantic similarity, business domain scores lean writing and provenance.

Figure 2 .
Figure 2. Approximate workflow used by most teams in the challenge.